# load the dataset
from datasets import load_dataset
= load_dataset('swag', 'regular') dataset
Introduction:
- In this notebook I will try to investigate the
SWAG
datasets. - The idea is to understand how to deal with multiple choice datasets and how to prepare them for the next step.
- Multiple choice is frequent problem in the filed of LLMs and NLP in general
- So the preprocessing of data will have a huge effect on the success of any proposed solution
# let's grab a sample
'train'][0] dataset[
{'video-id': 'anetv_jkn6uvmqwh4',
'fold-ind': '3416',
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
'sent2': 'A drum line',
'gold-source': 'gold',
'ending0': 'passes by walking down the street playing their instruments.',
'ending1': 'has heard approaching them.',
'ending2': "arrives and they're outside dancing and asleep.",
'ending3': 'turns the lead singer watches the performance.',
'label': 0}
- These fields represent the idea begind this dataset
- a situation where we have to predict the right ending
sent1
andsent2
represent the given situation and they added up tostartphrase
endings 0 to 3
represent the the endings for that situation, only one is the rightlabel
index the right answer
- Now let’s initialized
BERT
and load itstokenizer
.
from transformers import AutoTokenizer
= AutoTokenizer.from_pretrained('bert-base-uncased') tokenizer
- The idea here is to tokenize a start sentence with each one of the 4 choices,
= ["ending0", "ending1", "ending2", "ending3"]
ending_names
def preprocess_function(examples):
= [[context] * 4 for context in examples["sent1"]]
first_sentences = examples["sent2"]
question_headers = [
second_sentences f"{header} {examples[end][i]}" for end in ending_names]
[for i, header in enumerate(question_headers)
]
= sum(first_sentences, [])
first_sentences = sum(second_sentences, [])
second_sentences
= tokenizer(first_sentences, second_sentences, truncation=True)
tokenized_examples return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
To understand each operation of that function we will do it step-by-step:
- First let’s create a sub-set of the training set
- Create a
endings
list that we will use later
= ["ending0", "ending1", "ending2", "ending3"]
endings = dataset['train']
train_ds = train_ds[:20] smp
- Multiply each
sent1
by 4 and stack them all in a list:
= [[sent] * 4 for sent in smp['sent1']] sent_1
- Let’s retrieve the length of that list and see what’s inside one element of it.
2], len(sent_1) sent_1[
(['A group of members in green uniforms walks waving flags.',
'A group of members in green uniforms walks waving flags.',
'A group of members in green uniforms walks waving flags.',
'A group of members in green uniforms walks waving flags.'],
20)
- So basically we have 4 copies of each first-sentence of the dataset.
- Now we will create a list of the second-sentence or the header.
= smp['sent2'] headers
- At this point we have:
sent_1
which each element is multiplied by 4headers
that completesent_1
- The idea here is to create pairs of each
header
+sent_2
for eachsent_1
.
= [[f'{head}{smp[end][i]}' for end in endings] for i, head in enumerate(headers)] sent_2
- Now we need to flatten the pair of sentences, so we could tokenize them:
= sum(sent_1, [])
frst_sent = sum(sent_2, [])
scnd_sent = tokenizer(frst_sent, scnd_sent, truncation=True) tok_smp
- We tokenize the pair of list sentences which will return a dictionary with 3 keys:
tok_smp.keys()
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
- But since we already flattend the pairs before the tokenization step, we need to get them unflatten again so we can pass it through the
map()
function in order to be computed by the model.
= {k: [v[i: i + 4] for i in range(0, len(v), 4)] for k, v in tok_smp.items()} outputs
- Lets check if we get the unflatten step right, we just need to make sure that the
input_ids
of the first sentence has the same values in both:tok_smp
andoutputs
:
= tok_smp['input_ids']
flatten_smp = outputs['input_ids'] unflatten_smp
0:4] == unflatten_smp[0] flatten_smp[
True